Morpho-syntactic ambiguity and tagset design for Hungarian
نویسندگان
چکیده
The paper reports on work in progress to develop a tag set for Hungarian. The rich morphological structure of the language makes tagging feasible only after a full-scale morphological analysis, which results in a magnitude of patterns that do not easily translate into a corpus tag set of manageable size. The paper analyses the extent and types of morpho-syntactic ambiguity found in a 21m word sample of the Hungarian National Corpus as a preparatory stage in establishing a practical tag set.
منابع مشابه
Principled Hidden Tagset Design for Tiered Tagging of Hungarian
For highly inflectional languages, the number of morpho-syntactic descriptions (MSD), required to descriptionally cover the content of a word-form lexicon, tends to rise quite rapidly, approaching a thousand or even more set of distinct codes. For the purpose of automatic disambiguation of arbitrary written texts, using such large tagsets would raise very many problems, starting from implementa...
متن کاملAn Overview of Data-Driven Part-of-Speech Tagging
Over the last twenty years or so, the approaches to partof-speech tagging based on machine learning techniques have been developed or ported to provide high-accuracy morpho-lexical annotation for an increasing number of languages. Given the large number of morpho-lexical descriptors for a morphologically complex language, one has to consider ways to avoid the data sparseness threat in standard ...
متن کاملUsing a Large Set of EAGLES-compliant Morpho-syntactic Descriptors as a Tagset for Probabilistic Tagging
The paper presents one way of reconciling data sparseness with the requirement of high accuracy tagging in terms of fine-grained tagsets. For lexicon encoding, EAGLES elaborated a set of recommendations aimed at covering multilingual requirements and therefore resulted in a large number of features and possible values. Such an encoding, used for tagging purposes, would lead to very large tagset...
متن کاملUsing a Large Set of EAGLES-compliant Morpho-Syntactic Descriptors as a Tagset for Probabilistic Tagging
The paper presents one way of reconciling data sparseness with the requirement of high accuracy tagging in terms of fine-grained tagsets. For lexicon encoding, EAGLES elaborated a set of recommendations aimed at covering multilingual requirements and therefore resulted in a large number of features and possible values. Such an encoding, used for tagging purposes, would lead to very large tagset...
متن کاملHigh Accuracy Tagging with Large Tagsets
The paper presents experiments and results related to morpho-syntactic (MS) tagging of a highly inflectional language, based on combining language models (LM) learnt from multiple register-diversified corpora. To cope with a large tagset (614 tags), our underlying tagger uses a hidden smaller tagset (92 tags), mapped back, after the proper tagging, into the initial tagset. The same text is tagg...
متن کامل